The PHASAR Search Engine

نویسندگان

  • Cornelis H. A. Koster
  • Olaf Seibert
  • Marc Seutter
چکیده

This article describes the rationale behind the PHASAR system (Phrase-based Accurate Search And Retrieval), a professional Information Retrieval and Text Mining system under development for the collection of information about metabolites from the biological literature. The system is generic in nature and applicable (given suitable linguistic resources and thesauri) to many other forms of professional search. Instead of keywords, the PHASAR search engine uses Dependency Triples as terms. Both the documents and the queries are parsed, transduced to Dependency Triples and lemmatized. Queries consist of a set of Dependency Triples, whose elements may be generalized or specialized in order to achieve the desired precision and recall. In order to help in interactive exploration, the search process is supported by document frequency information from the index, both for terms from the query and for terms from the thesaurus. The professional retrieval process as found e.g. in Bio-Informatics can be distinguished into two phases, viz. Search and Analysis. The search phase is needed to find the relevant documents (achieving recall), the analysis phase for identifying the relevant information (achieving precision). Analysis can be performed visually and manually, partly automatically and partly interactively. Both processes are complicated by the fact that the searcher has to guess what words will be used in the documents to express the relevant facts about the topic of interest. For the search process, many online databases are available, but plowing through the many hits provided by a word-based search system is hard work. Analysis is a heuristic process, demanding all the searcher’s knowledge, experience and skills, for which presently only primitive keywordor patternbased support is available. The searcher may become very skilled in search and analysis, but the process is only partially automated and when searching for another topic the whole manual process must be repeated. In this paper we describe the ideas behind a novel search engine based on linguistically derived phrases, designed to support the search and analysis processes involved in retrieving information from the Biomedical literature. In section 1 we describe our approach to searching, which combines linguistic techniques with Information Retrieval. In section 2 we illustrate a possible way-of-working by means of an example. In section 3 we briefly describe the status of the implementation of PHASAR and the Medline/P collection used in the project. Finally, in section 4 we reflect on what has been achieved and what challenges are still ahead. 1 The PHASAR approach Many years of TREC challenges and evaluations have produced diminishing improvements in the precision and recall from best-practice query-based retrieval systems. In particular, the use of phrases in query-based retrieval has never led to the hoped-for breakthrough in accuracy. Due to the ubiquity of wordbased Search Engines like Lycos and Google, searchers have come to consider a combination of a few words as a natural and precise way to formulate a query, and they have learnt to cope with the deluge of hits that the query may cause by looking at only a few of them. In designing the PHASAR system, we tried to “rethink” the support of phrase-based search and retrieval and the corresponding way-of-working. 1.1 Professional search Professional search may be distinguished from what could be termed incidental search by the following characteristics: 1. the search is performed by professionals, in their own area of competence 2. it is worth investing some (expensive) time and effort 3. the search is over a very large collection of documents, including many which may be relevant 4. the information need is clear but complex, the user can recognize relevant answers 5. the information need may have to be answered by gathering (passages from) many documents 6. repetitions of the search process with small modifications in the query are routine. Contrast this with incidental search, which may have a vicarious and opportunistic character; where the searcher may easily be side-tracked; where out of a million hits only the first ten are ever considered; where the main problem is in formulating the information need, which can often be answered by a single document. 1.2 From specifying the question to specifying the answer In the traditional view of Information Retrieval, the searcher has to formulate his information need, and the task of the search algorithm is to find those documents (best) answering the information need. Taking this approach too literally may lead to a quest for complete, consistent and formal specifications of the information need, which are not only hard to construct but for which also no effective search algorithm exists short of a reasoning agent with a complete model of the world. This works well only for severely limited application areas (no world model needed) and for limited query formalisms (where a weak but efficient inference mechanism suffices, like the Vector Space Model). We take a different point of view: The searcher has to indicate what formulations in the documents are expected to be relevant to his information need. He has to guess what the answers will look like, rather than the question (a form of query-by-example). 1.3 Dependency Triples as terms The Information Retrieval community has for a long time had high expectations of the use of phrases as index terms instead of keywords, but in practice the benefits were found hard to realize (see [Strzalkowski, 1999], in particular [Sparck Jones, 1999]). Rather than single words, word pairs [Fagan, 1988,Strzalkowski, 1995] or sequences of words (word n-grams; chunks; complete noun phrases) we shall use dependency triples as terms in the document representation and the search process. A dependency triple (DT) is a pair of (lemmatized) words (the head and the modifier) occurring in the document in a certain relation, which we denote by [head RELATION modifier]. Dependency triples (aka dependency relations) have already been used successfully in Question Answering [Bouma et al, 2005,Cui et al, 2005] for the precise matching of answers from a database to questions, but we propose to use DT’s directly in a search interface querying large collections of free text without special preprocessing. They are closely related to the Index Expressions of [Grootjen and van der Weide, 2004]. The following table illustrates the major syntactic dependency relations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparing between the impacts of text based indexing and folksonomy on ranking of images search via Google search engine

Background and Aim: The purpose of this study was to compare the impact of text based indexing and folksonomy in image retrieval via Google search engine. Methods: This study used experimental method. The sample is 30 images extracted from the book “Gray anatomy”. The research was carried out in 4 stages; in the first stage, images were uploaded to an “Instagram” account so the images are tagge...

متن کامل

Review of ranked-based and unranked-based metrics for determining the effectiveness of search engines

Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...

متن کامل

A New Hybrid Method for Web Pages Ranking in Search Engines

There are many algorithms for optimizing the search engine results, ranking takes place according to one or more parameters such as; Backward Links, Forward Links, Content, click through rate and etc. The quality and performance of these algorithms depend on the listed parameters. The ranking is one of the most important components of the search engine that represents the degree of the vitality...

متن کامل

Getting to Know Wolfram|Alpha Computational Knowledge Engine and Its Applications in Biomedical Sciences

  Wolfram|Alpha Computational Knowledge Engine software, despite all internet search engines, tries to provide the the best answer for a question or compute an equation in the most correct way based on the current knowledge. Therefore, given the unique characteristic of Wolfram|Alpha and its vast applications, the aim of the present article is to familiarize the biomedical scientists with...

متن کامل

ارزیابی خودکار جویش‌گرهای ویدئویی حوزه وب فارسی بر اساس تجمیع آرا

Today, the growth of the internet and its high influence in individuals’ life have caused many users to solve their daily needs by search engines and hence, the search engines need to be modified and continuously improved. Therefore, evaluating search engines to determine their performance is of paramount importance. In Iran, as well as other countries, extensive researches are being performed ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006